CHAPTER 8 Getting Your Data into the Computer 105

Collecting categorical data in

your research database

Setting up your data collection forms and database tables for categorical data

requires more thought than you may expect. You may assume you already know

how to record and enter categorical data. You just type in the values — such as

“United States,” “nurse,” or “Stage I” — right? Wrong! (But wouldn’t it be nice

if it were that simple?) The following sections look at some of the issues you have

to address when storing categorical values as research data.

Carefully coding categories

The first issue you need to decide is how to code the categories. How are you going

to store the values in the research database? Do you want to enter the type of care

provider as nurse, physician, or social worker; or as N, P, or SW; or as 1 = nurse, 2 =

physician, and 3 = social worker; or in some other manner? Most modern statistical

software can analyze categorical data with any of these representations, but it is

easiest for the analyst if you code the variables using numbers to represent the

categories. Software like SPSS, SAS, and R lets you specify a connection between

number and text (for example, attaching a label to 1 to make it display Nurse on

statistical output) so you can store categories using a numerical code while also

displaying what the code means on statistical output. In general, best practices are

to set conventions and be consistent, and make sure the content and meaning of

each variable is documented. You can also attach variable labels.

Nothing is worse than having to deal with a data set in which a categorical variable

has been stored with numerical codes, but there is no key to the codes and the

person who created the data set is no longer available. This is why maintaining a

data dictionary — described later in this chapter in “Creating a File that Describes

Your Data File” — is a critical step for ensuring you analyze your research data

properly.

Microsoft Excel doesn’t care whether you type a word or a number in a cell, which

can create problems when storing data. You can enter Type of Caregiver as N for the

first subject, nurse for the second, NURSE for the third, 1 for the fourth, and Nurse

for the fifth, and Excel won’t stop you or throw up an error. Statistical programs

like R would consider each of these entries as a separate, unique category. Even

worse, you may inadvertently add a blank space in the cell before or after the text,

which will be considered yet another category. Details such as case-sensitivity of

character values (meaning patterns of being upper or lowercase) can impact que-

ries. In Excel, avoid using autocomplete, and enter all levels of categorical vari-

ables as numerical codes (which can be decoded using your data dictionary).